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ON THE CONVERGENCE TO EQUILIBRIUM OF 
KAC'S RANDOM WALK ON MATRICES 

By Roberto Imbuzeiro Oliveira 1 

Instituto de Matemdtica Pura e Aplicada (IMP A ) 

We consider Kac's random walk on n-dimensional rotation ma- 
trices, where each step is a random rotation in the plane gener- 
ated by two randomly picked coordinates. We show that this pro- 
cess converges to the Haar measure on SO(n) in the L 2 transporta- 
tion cost (Wasserstein) metric in 0(n 2 Inn) steps. We also prove that 
our bound is at most a O(lnn) factor away from optimal. Previous 
bounds, due to Diaconis/Saloff-Coste and Pak/Sidenko, had extra 
powers of n and held only for L 1 transportation cost. 

Our proof method includes a general result of independent inter- 
est, akin to the path coupling method of Bubley and Dyer. Suppose 
that P is a Markov chain on a Polish length space (M, d) and that for 
all x, y £ M with d(x, y) <C 1 there is a coupling (X, Y) of one step of 
P from x and y (resp.) that contracts distances by a (£ + o(l)) factor 
on average. Then the map /j, t— > fiP is ^-contracting in the transporta- 
tion cost metric. 

1. Introduction. Around 50 years ago Kac [7] introduced a one-dimensional 
toy model of a Boltzmann gas. It is a discrete-time Markov process whose 
state at a time t £ {0, 1, 2,3,.. .} is a vector 

v(t) = (v 1 (t),...,v n (t))eR n , 

corresponding to the velocities of n interacting particles of equal mass. At 
each time t, a uniformly distributed pair 1 < it < jt < n and a uniform angle 
9t £ [0, 2n] are chosen independently. This choice corresponds to a collision 
between particles it,jt whose velocities are changed to new values 

v it (t + 1) = cos 9 t v it (t) + sin 6 t v jt (t), 

Vj t (t + 1) = - sin O t v it (t) + cos 6 t Vj t (t), 
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whereas the other velocities are kept the same. This prescription for the new 
velocities implies that the total kinetic energy 

n 

E(t)^J2 v ^) 2 

k=l 

is conserved. 

For each time step t, 

v(t + l) = R(i t ,j t ,6 t )v(t), 

where R(it,jt,0t) is a rotation by Q% of the plane generated by the coordinates 
it and jt in n-dimensional space. Two related processes have been studied 
in the literature under the heading of "Kac's random walk" : 

• Suppose E(0) = 1. Then the evolution of v(0), v (1), v (2), v(3), . . . corre- 
sponds to an ergodic Markov chain over the (n — 1) -dimensional sphere 
S™" 1 C W 1 , with uniform invariant distribution. This is the model orig- 
inally considered by Kac [7] in his investigations of the foundations of 
Statistical Mechanics. See [4, 6, 16] and the references therein for more 
works in similar directions. 

• One might also consider the random walk on matrices determined by 
choosing some X(0) £ SO(n), the set ofnxn rotation matrices, and then 
setting 

X(t+l) = R(i t ,jt,9 t )X(t), t>0. 

This is a discrete-time ergodic random walk on SO(n) whose stationary 
distribution is a Haar measure on SO(n). This process has also been exten- 
sively studied, both for its intrinsic interest and as a sampling algorithm — 
indeed, a "Gibbs sampler" [5] — for a Haar measure. Interestingly, this pro- 
cess is featured in Hastings' seminal 1970 paper on Markov chain Monte 
Carlo [8]. See [1, 4, 5, 12] for more details. 

The question arises of how fast Kac's random walk on SO{n) converges 
to equilibrium. This question may be posed in different forms. Convergence 
of density functions to equilibrium is very well understood: Janvresse [6] 
obtained the first bound of optimal magnitude 0(n _1 ) on the L 2 spectral 
gap of the chain on S n ~ l . Carlen, Carvalho and Loss [4] obtained the exact 
spectral gap for both processes. Finally, Maslen [10] computed the entire 
spectrum for both processes. 

Convergence to equilibrium in total variation also occurs, as shown by 
Diaconis and Saloff-Coste [5] who obtained a very poor e°( n ' mixing time 
bound for convergence in total variation of the matrix process. We cannot 
improve on this bound, but note that total variation is perhaps too strin- 
gent a notion of convergence for simulations (as it is sensitive to errors at 



KAC'S RANDOM WALK ON MATRICES 



3 



arbitrarily small scales), whereas convergence of densities is too weak (e.g., 
when one starts from a discrete distribution). 

We consider an intermediate notion of convergence to equilibrium based 
on transportation cost. Given a metric space (M,d) and two probability 
measures fi, v over the Borel cr-field of M, the LP transportation cost (or 
Wasserstein) distance between \x and v is 

W diP (n, v) = inf{(E[d(X, Yf]) 1/p : (X, Y) is a pair 

of random variables coupling (/x, v)} 

(see Section 2.2 for a formal definition). Diaconis and Saloff-Coste [5] and 
Pak and Sidenko [12] use the dual characterization of W^i [15], Remark 6.5, 
that is especially relevant for simulations: 

(1) Wd,x(fJf, v) = sup j J fd(fi — v) : f : M — > R is 1 — Lipschitz under d\. 

That is, if one can sample from ii, we can estimate J M fdu for any Lips- 
chitz / up to a W<2 i(//, v) intrinsic bias. This is a natural metric for many 
applications; as a case in point, we briefly discuss a suggestion of Ailon and 
Chazelle [1] . It is well known that one can "reduce the dimension" of a point 
set S C R n while approximately preserving distances by first applying a ran- 
dom linear transformation X drawn from the Haar measure on SO(n) and 
then projecting onto the first k coordinates. A result known as the Johnson 
Lindenstrauss lemma says that if one chooses k = 0(ln\S\/e 2 ) (which does 
not depend on the ambient dimension n), then the ratios of pairwise dis- 
tances in S are all preserved up to (1 ±e)-factors, with high probability. One 
can easily check that a similar result holds when X is sufficiently close to 
being Haar distributed in the Wd.i metric (for an appropriate metric d; see 
below). As noted in [1], for X = X(t) coming from Kac's random walk, the 
products st = X(t)s (s £ S) can be computed with just a constant amount of 
extra memory, as the map sj i— > st+i affects only two coordinates of sj; hence, 
if we can prove that X(t) converges rapidly to a Haar measure in the Wdi 
distance, we have a time- and memory-efficient way of doing dimensionality 
reduction. 

Our main result is a rapid mixing bound for the SO(n) walk. We consider 
M = SO(n) with two different choices of metric d. For a, b £ SO(n) we define: 

hs(a, b) = \\a — b\\^ s = \^Tr((a — 6)t (a — b)) the Hilbert-Schmidt norm; 
D(a,b) = the Riemannian metric on SO{n) induced by the Hilbert-Schmidt 
inner product (u, «)hs = Tr("u'u). 
Clearly, hs < D always. Define the LP transportation-cost mixing times: 
Td tP (s) = inf{t G N : W^p^iK t , TL) < e for all prob. measures fx on SO(n)}, 
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where d = D or hs; 7i is the Haar measure on SO(n); K is the transition 
kernel for Kac's walk; and [iK 1 is the time-t distribution of a walk started 
from distribution \i. Note that 7i 1S)P (-) < td, p (-) and that both mixing times 
are increasing in p. We will show the following: 

Theorem 1 (Main result). For all n G N\ {0, 1}, Kac's random walk on 
SO{n) satisfies the following mixing time estimate: 



TD,2(e) < 



n 2 In 



irWn 



Thus, 0(ra 2 lnn) steps of the Markov chain suffice to bring fiK 1 e-close to 

the Haar measure 7i for any e = n~°^\ This improves upon the 0(n 4 lnn) 

bound by Diaconis and Saloff-Coste [5] and a very recent preprint by Pak and 

Sidenko [12] that lowered the estimate to 0(n 2,5 Inn) (we only learned about 

that result after proving the main results in the present paper). Moreover, 

these two papers treated only the L 1 case for d = hs, whereas we consider 

the stronger L? case with the stronger metric D. 

We also show that our bounds are tight up to a O(lnn) factor, for all 
n -0(i) < e < £o ( eo 

some constant), even when applied to p = 1 and d = hs. 

Theorem 2. There exist c,eo> such that, for all n G N \ {0, 1} ; 

Ths,i(£o) > era 2 . 

Theorem 2 follows from a general lower bound for the mixing time of 
random walks induced by group actions. The general result might be already 
known, but since we could not find a proof of it elsewhere, we provide our 
own proof in Section 6. The bound in Theorem 2 was also claimed in [12] . 

The key to proving our main result, Theorem 1, is a contraction property 
of the Markov transition kernel of the random walk under consideration. Fix 
again a metric space (M,d). For £ > 0, say that a Markov transition kernel 
P on M is ^-Lipschitz for the Wd p metric if for all probability measures 
fj,, v on M with finite pth moments (cf. Section 2.2) 

(2) W dtP ([j,P,vP)<ZW dtP ([j,,v). 

If £ < 1, we shall also say that P is ^-contracting. We will prove the following 
estimate: 

Lemma 1. In the same setting as Theorem 1, Kac's random walk on 
matrices is 



I 1 

1 — -^-contracting 



in the Wd 2 metric. 
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The proof of Lemma 1 follows a strategy related to the path coupling 
method for discrete Markov chains introduced by Bubley and Dyer [3]. Sup- 
pose P is now a Markov chain on the set of vertices V of a connected graph 
G. The graph induces a natural shortest-path metric d on G. It is sometimes 
possible to prove a "local contraction" estimate of the following form: for 
any x,y £ V that are adjacent in G, there is a coupling of X (distributed 
according to one step of P from x) and Y (distributed according to one step 
of P from y) such that 

E[d(X,Y)]<Z = t;d(x,y)<l. 

If that is the case, Bubley and Dyer proved that the local couplings extend to 
"globally contracting" couplings for all random pairs (x,y) = (Xq,Yq) € V 2 , 
with 

E[d(X,Y)]<ZE[d(X ,Y Q )}. 

This implies, in particular, that W ( ii(fiP t ,vP ) < diam(G)£* for all distri- 
butions fi, is, where diam(G) is the diameter of the graph G. In the discrete 
setting such results easily extend to total variation bounds. 

Our adaptation of their technique is based on the fact that SO(n) is a 
geodesic space with the metric D: that is, D(a, b) is the length of the shortest 
curve connecting a and b. We will show that whenever (M, d) is a geodesic 
space (or more generally a length space; see Section 2.1) and P is such that, 
for all deterministic x,y £ M with d(x,y) <C 1, 

E[d(x,Yy]<(t + o(i)) p d(x, y y, 

then P is ^-contracting and Wrf iP (/uP', rjP ) < diam(M) for all probability 
measures (i, r\ with finite pth. moments, where diam(M) is the diameter of M. 
That is, we show that if (M, d) is a Polish length space and P satisfies some 
reasonable assumptions, one can always extend "local contracting couplings" 
of P-steps from nearby deterministic states to "global contracting couplings" 
for arbitrary initial distributions. This result is stated as Theorem 3 below. 

As with the original path-coupling methodology, proving local contraction 
is the problem-specific part of our technique. For Kac's walk, one can use 
the local geometry of SO(n) as a Riemannian manifold to do calculations in 
the tangent space, which greatly simplifies our proof. The same idea can be 
applied to two related random walks (discussed in Section 5): 

• a variant of Kac's walk where 9t is nonuniform; 

• a random walk on the set U(n) of n x n unitary matrices where each step 
consists of applying a unitary transformation from U{2) to the span of a 
pair of coordinate vectors. 
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Pak and Sidenko [12] use a related coupling construction, but neither use 
the local structure of SO(n) as effectively, nor do they state any general 
result on local-to-global couplings. Diaconis and Saloff-Coste [5] use the an- 
alytic technique known as the comparison method, which seems intrinsically 
suboptimal for this problem, as well as more difficult to apply. [These two 
papers also handle some variants of Kac's process which do not seem to be 
related to the case we consider in Section 5.] 

The general idea of contracting Markov chains with continuous state 
spaces has appeared in other works. Particularly relevant is a preprint of 
Ollivier [11], released during the preparation of the present paper, that con- 
tains a result related to (but a bit weaker than) our "path coupling" result, 
Theorem 3. That paper is devoted to the study of "positive Ricci curva- 
ture" for Markov chains on metric spaces, which is precisely what we call 
£-contractivity for W^i; from that one can deduce many properties, such 
as concentration for the stationary distribution and some log-Sobolev-like 
inequalities. See [11] for details and other references where contraction prop- 
erties of Markov chains have been used recently. There have been many other 
recent results involving analytic, geometric and probabilistic applications of 
transportation cost [9, 13, 14]; this suggests that our techniques may find ap- 
plications in that growing field. Of course, we also hope that our techniques 
will be applied to obtain mixing bounds of other Markov chains of intrinsic 
interest, not necessarily related to such geometric and analytic phenomena. 

The remainder of the paper is as follows. Section 2 reviews some im- 
portant concepts from probability, metric geometry and optimal transport. 
Section 3 proves our general result on local-to-global couplings, Theorem 3. 
Section 4 contains the definition of Kac's random walk on matrices and the 
proofs of Lemma 1 and Theorem 1 . Section 5 sketches the two other random 
walks described above. Mixing time lower bounds are discussed in Section 6. 
Finally, Section 7 discusses other applications of our method and presents 
an open problem. 

2. Preliminaries. 

2.1. Metric spaces, length spaces, a -fields. Whenever we discuss metric 
spaces (M,d), saying that A C M is measurable will mean that A belongs 
to the cr-field generated by open sets in M, that is, the Borel u-field B(M). 
Moreover, all measures on metric spaces will be implicitly defined over Borel 
sets. We will always assume that the metric spaces under consideration are 
Polish, that is, complete and separable. 

Let 7:[a,6]->Mbea continuous curve. The length £^(7) of 7 (according 
to the metric d) is the following supremum: 

L d (i) = sup d(7(*i-i)» 7(<0) : n e N, a = t < h < t 2 < • • • < t n = 6 j . 
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The curve 7 is rectifiable if ^(7) < +00. The metric space (M,d) is a length 
space if for all x, y £ M 

g£(x, y) = inf {^(7) : 7 : [0, 1] — ► M continuous, 7(0) = x, 7(1) = y}. 

All complete Riemannian manifolds and their Gromov-Hausdorff limits 
are length spaces. Nonlocally-compact examples of Polish length spaces in- 
clude separable Hilbert spaces, as well as infinite-dimensional L\ spaces. 

2.2. Distributions, couplings and mass transportation. All facts stated 
below can be found in [15], Chapter 6. 

Let (M,d) be a metric space and Pr(M) be the space of probability 
measures on (the Borel cr-field of) M. Given fj,,u £ Pr(M), a measure v £ 
Pr(M x M) (with the product Borel a-field) is a coupling of (//, v) if for all 
Borel- measurable AG M, 



The set of couplings of (/i, v) is denoted by Cp(/i, f). This is always a 
nonempty set since the product measure fx x v is in it. 

Given p > 1, Pr^ p(M) C Pr(M) is the set of all probability measures 
with finite pth moments, that is, such that for some (and hence all) o £ M, 



JM 

One can define the LP transportation cost (or LP Wasserstein) metric 
Wd :P on Prd !P (M) by the formula 



Such metrics are related to the "mass transportation problem" where one 
attempts to minimize the average distance traveled by grains of sand when 
a sandpile is moved from one configuration to another. 

It is known that (Pr d)P (M), W d>p ) is Polish iff (M, d) is Polish. If (M, d) is 
Polish, the infimum above is always achieved by some rj = n opt (/i, is), which 
we will refer to LP -optimal coupling of [i and v. 

For x £ M, 5 X £ Pr(M) is the Dirac delta (or point mass) at x, the dis- 
tribution that assigns measure 1 to the set {x}. A basic property of mass 
transportation is that if x,y € M, then 



r}{A x M) = 12(A) 



r](M x A) = v(A). 




(3) 




/x,z/£Pr d)P (M). 



Wd lP (S x ,Sy) = d(x,y). 



If \x £ Prd )P (M) and 5 X is as above, 
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It is often convenient to deal with random variables rather than measures. 
If X is a M- valued random variable, 

C x G Pr(M) 

is the distribution (or law) of X. Notice that 

C-x e Prd,p( M ) Hd(o, X) p ] < +00 for some (all) o G M. 

We will write 

whenever X is a random variable with Cx = ^- Call a random pair (X, Y) a 
coupling of (/x, u) if C>(x,Y) G Cp(/i,z/). Wd,p{n,v) can be equivalently viewed 
as the infimum of E,[d(X,Y) p ] 1 / p over all such couplings. 

Finally, we note that if M is compact (as it is in our main application), 
then for any p > 1 Prrf jP (M) = Pr(M) and Wd, P metrizes weak convergence. 

2.3. Markov transition kernels. In this section we assume (M, d) is Pol- 
ish. A Markov transition kernel on M is a map 

P : M x B(M) -► [0, 1] 

such that, for all x G M, -Pr( - ) = P(x,-) 1S a probability measure and for 
all ^4 g B(M), P x (A) is a measurable function of x. A Markov transition 
kernel defines an M- valued Markov chain: for each /i G Pr(M), there exists 
a unique distribution on sequences of M-valued random variables 

such that X(0) =c [i and for all t G {1,2,3,...}, the distribution of X(t) 
conditioned on {X(s)Y s Z} is Px(t-i)- 

For [i G Pr(M) and t G N, [iP l is the measure of X(t) defined as above; 
one can check that fj,P t+1 = (//P')P for all t > 0. 

3. From local to global couplings. In this section we will discuss our 
method for moving from local to global bounds for the Lipschitz properties 
of Markov kernels. In our application we have a Markov kernel P on a Pol- 
ish space (M, d). Using explicit couplings, we will show that, for some C > 
and all x, y G M, 

W d , p (P x ,P y )<(C + o(l))d{x,y), 

where o(l) — > when y — > x. The main result in this section implies that, 
under some natural conditions, it follows that Wd tP (^P, vP) <Cr whenever 
fi, v G Prrf jP (M) are r-close. 
We first state a definition. 



KAC'S RANDOM WALK ON MATRICES 



9 



Definition 1 . A map / : M — ► N between metric spaces (M, d) and 
(N,d') is said to be locally C -Lipschitz (for some C > 0) if for all x £ M 

hm sup r < C . 

y~*x d[x,y) 

Theorem 3 (Local-to-global coupling). Suppose (M,d) is a Polish length 
space, p > 1 is given and P is a Markov transition kernel on (M, d) satisfying 
the following characteristics: 

1. P x has finite pth moments for all x: that is, P x £ Pr^ p (M) for all x £ M ; 

2. P is locally C -Lipschitz on M . That is, the map 

from (M,d) to (Pr<2 lP (M), Wd p ) is locally C -Lipschitz. 

Then for all \x £ Pr^ jP (M), we also have (iP £ Pr dtP (M) and, moreover, the 
map [i i — > [iP is C -Lipschitz, that is, 

V/i, v £ Pr d>p (M), W d , p (fiP, uP) < CW d ^i, v). 

Before we prove this result, we discuss its application to the setting where 
C = (1 — k) for some k > 0, the diameter diam^(M) of (M,d) is bounded 
(Ollivier [11] noted that, for C = (1 - k) < 1, diam d (M) < 2A//c, where A = 
su PxeM Wd tP (6 x ,Px)- Hence, the assumption that diam^(M) < +oo is equiv- 
alent to A < +oo) and the other assumptions of Theorem 3 are satisfied. In 
this case Pr^ p (M) = Pr(M), that is, all probability measures have bounded 
pth moments. Moreover, Banach's fixed point theorem states that a (1 — k)- 
Lipschitz map from a complete metric space to itself has a unique fixed 
point. Since (Pr(M), Wd p ) is Polish and fit—* \xP is a (1 — K)-Lipschitz map 
from this space to itself, there exists a unique element //* £ Pr(M) with 

f-i^: P f-l* • 

It follows that \i* is the unique P-invariant distribution on M. Moreover, 
for all i £ N and /x £ Pr(M), 

Wd, P (»P t ,»*) = Wd ! p(»P t ,IJ,*P t ) < (1 - KfWd^fJ,*) < (dmm d (M))e- Kt . 
We collect those facts in the following corollary. 

Corollary 1. Assume (M,d) and P satisfy the assumptions of Theo- 
rem 3 for some p>l and C = (1 — k) < 1 (i.e., k > 0). Assume, moreover, 
that the diameter diamrf(M) of (M,d) is finite. Then there exists a unique 
P -invariant measure /i* on M. Moreover, the L p transportation cost mixing 
times 

T d)P {e) = min{t £ N : V/i £ Pr (M), W d ^P\ n*) < 4 
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satisfy 



Td, P { £ ) < k In 



( 



diamd(M) 



We now proceed to prove the theorem. 

Proof of Theorem 3. The first step of the proof is a simple lemma 
(proven subsequently) about locally Lipschitz functions. 

Lemma 2. With the notation of Definition 1, assume that M is a length 
space. Then any f :M — > N that is locally C -Lipschitz is C -Lipschitz accord- 
ing to the standard definition. 

For our proof we only need the following direct consequence [let (N, d') = 



Corollary 2. If P is a Markov transition kernel on a length space 
(M,d) satisfying condition 2 of Theorem 3, then Wd, P {P x ,P y ) <Cd(x,y) 
for all x, y G M. 

The bounding of Wd p (P x ,Py) can be thought of as an implicit construc- 
tion of a coupling along a geodesic path; this is precisely where the name 
"path coupling" comes from. 

The second lemma we need (proven in Section 3.2) shows that \iP G 
Pr^p(M) whenever \jl G Pr^ )P (M) and shows that we will only need to com- 
pare \xP and vP, for [i, v with countable support. 

Lemma 3. Let (M, d) be Polish. Suppose P is a Markov transition kernel 
on M such that: 

1. P x G Pr d , p (M) for all x G M; 

2. x i — ► P x is a C -Lipschitz map from M to Pr^ P (M). 

Then for all \i G Pr^ )P (M) we have fj,P G Pr^ )P (M). Moreover, there exists 
a sequence {^j}j C Pr^ iP (M) of measures with countable support such that 
Wd,p Guj>aO and W d , p (njP, /iP) -> 0. 

The lemma implies the following statement: if W d , p (fJ>P, vP) < CWd tP (n, v) 
for all fj,, v in Vr dp (M) that have countable support, then the same holds 
for all /i, v in Vi dp {M). Our final goal is to prove the Lipschitz estimate for 
measures with countable support. 



(Pv diP (M),W d , P ), f{x) = P x ]. 
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Thus, let /x = J2j£NPj$xj b e a convex combination of a countable number 
of point masses (xj € M for all j); similarly, let v = J2k£Nlkfiy k - The L p - 
optimal coupling 77 of /j and v is of the form 

for some convex weights r^ k . Now define for each pair j, k a L p -optimal 
coupling Q jjk of P Xj ,P yk . Then 

?/ = r hkQj,k S Cp(/iP, iaP). 

Moreover, since x *— > P x is C-Lipschitz, 

d{u, v) p dQ jjk (u, v) = W dyP (P X] , P y J < C p d( Xj ,y k ) p , 



IMxM 

which implies 



W d , p (fxP,uP) p < [ d(u,v) p dri'(u,v) 

JMxM 

= r i,k / d(u,v) p dQ jtk (u,v) 
<C P Y. r^d{x hVk ) v 

j,k£N 

= C P f d(u,v) p d v (u,v). 

JMxM 

The RHS is simply C p W d , p {fi, u) p . □ 

Remark 1. Ollivier presents a similar result for p = 1 in [11], Propo- 
sition 17. His proof relies on a quite nontrivial fact (proven in, e.g., [15]): 
the existence of a Markov transition kernel Q on M such that, for all 
(x,y) G M 2 , Q( x ,y) is a 1-optimal coupling of (P x ,P y ). Our argument pro- 
vides an alternative approach, which is perhaps simpler, to the same result. 
Moreover, his proposition implies our theorem only when P satisfies: 

Iimsup sup ■ — < C, 

r\0 x,y£M : d(x,y)<r r 



which is a stronger requirement than our local Lipschitz condition. 
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3.1. Proof of Lemma 2. It suffices to show that, for all x,y E M, any 
continuous curve 7 : [0, 1] — ► M connecting 7(0) = x to 7(1) = y and any 
number C' > C, 

W d>p (P X ,Py)<C'L d ( 7 ). 

To prove this, assume without loss of generality that L d (j) < +00. For < 
ti <t2< 1, define the length function 

[min{ti ,t2},max{ti fa}] ) • 

It is an exercise to show that I is a continuous function ^(ti, £2) > 7(^2)) 
and 

(4) V0<t 1 <t 2 <t 3 <l, %,t 2 )+rf(7(*2),7(*3))<^l,t3)- 
For each i E [0, 1], we have 

Umsu /(/(7W),/(7W)) , ,^ su /-(/(7( ; )),/(7W)) < 

by the local Lipschitz assumption. Since C" > C, one can find, for any t E 
[0,1), some 5 t E (0, 1 - t) such that Vs E (t,t + <$t], d'(f(-f(s))J(-f(t))) < 
C£(t,s). 
Now set 

T = sup{t E [0, 1] : d'(/( 7 (0)), /( 7 (*))) < C"^(0, t)}. 

Notice that 

(5) d'(/( 7 (0)),/( 7 (T)))<C'£(0,r) 

by continuity. We claim that T = 1. To see this, suppose T < 1 and set 
<y = *r. Then 

d'(/( 7 (0)), /( 7 (T + 5))) < d'(/( 7 (0)), /(7(D)) 

+ d'(/(7(T)),/(7(T + 5))) 
[use (5) and defn. of «5 T ] < C'l(0,T) + C"d( 7 (T),7(r + 8)) 
[use (4)] =C"£(0,T + <$), 

which contradicts the fact that T is the supremum of the corresponding set. 
We deduce that T = 1 and 

<f(f{x),f(y)) = d'(/(7(0)), /(7U))) < C*(0, 1) = C%( 7 ), 
as desired. 
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3.2. Proof of Lemma 3. For the first statement we note that, for a given 
reference point y G M, 

A p = f d(y,z) p dP v (z) <+oo. 
JM 

Now for any x G M, let (X, Y) be a L p -optimal coupling of (P x ,P y ). Then 

< lld(i/^)llip + IK^^)]|£p = A+w^pCi^,^) < A+cdC^.y), 

which is the same as 

f d(y,v) p dP x (v)<(A + Cd(x,y)) p . 

JM 

Hence, if \i G Pr^p(M), 

/ d(y,v) p dfiP(v)= [ ([ d(y,v) p dP x (v))dfi(x) 
JM JM \JM J 

= [ (f [A + Cd(y,v)]PdP x (v))dfi(x) 

JM \JM / 

[use \a + b\P<2P(\a\P + \b\P)] < {2Cf [ f d(y,v) p dP x (v) dfi(x) 

JM JM 

+ 2 P [ [ A p dP x (v)dfi(x) 

JM JM 

< (2C) P [ d(y, v) p dn(x) + (2A)P 

JM 

[H G Pr d>p (M)] < +oo. 

Thus, fiP is in Pr^ p (M) whenever /i is. 

We now present a discrete approximation scheme for /i and [iP . Since M 
is separable, there exists a sequence of partitions {"PjjjgN of M such that: 

• each partition contains countably many measurable sets; 

• for all j G N, Vj+i refines Vj', and 

• for all j G N, the sets in Vj have diameter at most Ej for some sequence 

Let us also assume that for each j G N and A G Vj we have picked some 
^ A. Consider the measures 

(6) fij = K A ) 5 X U) • 

A&i A 

Clearly, fij G Pr^ p (M) for all j and Wd )P (fJ-j , fJ>) — > when j — > +oo. Our 
goal will be to show that W^pifij P, fiP) — > 0. First recall that x ^ P x is 
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C-Lipschitz, hence, if x,y G M and d(x,y) < £j, Wd tP (P x , P y ) < Ctej. In par- 
ticular, for all j G N, all vl G P,- and all x £ A, 

W diP (P( jh P x )<Csj. 

A 

We will use this to show that 

Vj<A;, W^-P^P^C^ 

(7) 

(in particular, {/j,jP}j is Cauchy). 



Recall that if j < k, Vk is a refinement of Vj, hence, for all B G Vk there exists 

(k) 

a set G Vj with B C Ab- For each such P, we have £ As, which has 
diameter < Ej, hence, d(xg i x aL) — £ i anc ^ there exists a coupling r]B,k,j of 



Is 

P (fe) and P (j) with 



B 



Extend this to a coupling of HkP and /ZjP by 



Vk,j= E KB)VB,k,j- 
Bev k 

To prove that rjkj G Cp(^jP, /i^P), notice that the first marginal of this 
measure is 

£ MP)P x(fe) = ^p 

sen B 

Moreover, for any A G Vj , the set of all P G P& with = ^4 is a partition 
of A, hence the second marginal is also right: 

E »(b)p x u) = E ( E Mfl))i> = E =^p 

BePfe Ab AeVj \Ber k •. a b =a I A AeVj A 

It follows that r]k j G Cp (fij P, //fcP) and, moreover, one can check that 



d(«,«)fd% J -(«,«)<(C7e j ) p J 
which implies (7). 

(Pr^p(M), Wd.p) is Polish since (M,d) is. By the above, we know that 
there exists a measure a G Pr^ p (M) such that Wd jP (njP, a) — > 0. This also 
implies [15], Theorem 6.8, that /ijP =^ a in the weak topology. However, it is 
an exercise to show that fijP => \xP weakly, hence, a = [iP and Wd tP (fijP, //P) 
0, as desired. 
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4. Analysis of Kac's random walk. 

4.1. Definitions. Let M(n,R) be the set of all n x n matrices with real- 
valued entries. These are the linear operators from M. n to itself and we equip 
M. n with a canonical basis e\ , . . . , e n of orthonormal vectors. For a £ M(n,M), 
a) is the transpose of a in the basis e±, . . . ,e n . Using it, one can define the 
Hilbert-Schmidt inner product (a, b)^ = Tr(a^6) on M(n,M), under which 

2 

it is isomorphic to M n with the standard Euclidean inner product. We let 
|| • ||hs be the corresponding norm. 

An element a G M(n,M.) is orthogonal if aa) = id, the identity matrix. 
The subset of M(n,M) given by 

SO(n) = {a G M (n, R) : aa) = id, det(a) = 1} 

is a smooth, compact, connected submanifold of M(n,M). It is also a Lie 
group since it is closed under matrix multiplication and matrix inverse. 
Therefore, SO{n) has a Haar measure TC, which we may define as the unique 
probability measure on that group such that, for all measurable S C SO{n) 
and a G SO(n), we have TC(S) = TL{Sa) = Tt(aS), where Sa = {sa:s G S} 
and aS = {as : s G 5}. 

We now define Kac's random walk on SO{n). For 1 < i < j < n and 8 G 
[0, 2tt] define R(i,j,9) as a rotation by 9 of the plane generated by ej,ey. 
This is equivalent to setting 

!cos 6ei + sin ^e^ , k = i, 
cos fle^ — sin 9e j , fc = j , 
e fc , fcG {l,...,n}\{i,j}, 

and extending R(i,j,9) to all ^ G M n by linearity. Kac's random walk on 
matrices corresponds to the following Markov transition kernel: 

1 

K x(S) = . n . / 5 R ( ijd ) x (S)d8 

\2/ l<i<j<n " 

[x G SO(n), S C SO {n) measurable]. 

Thus, to generate X =£ ^ from x, one chooses 1 < i < j < n uniformly at 
random from all Q) possible choices, then picks G [0, 27r] also uniformly at 
random and then sets X = R(i,j, 9)x. The required measurability conditions 
are easily established. One can also check that the Haar measure TC is K- 
invariant. 

4.2. The geometry of SO{n). We collect some standard facts that will 
be used in our proofs. 
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The tangent space at the identity matrix id is the set of all anti-self- adjoint 
operators 

(9) T = T id SO(n) = {he M(n, R):rf = -h}. 

We let D be the Riemannian metric on SO{n) induced by (-,-")hs- Since 
SO(n) is compact, one can show the following: 

(10) \/z,w£SO(n), \\z — w||hs < D(z, w) < \\z — t^||hs + 0(\\z — w\\^ s ), 

where 0(r a ) is just some term whose absolute value is uniformly bounded 
by c\r\ a and c> a constant not depending on \r\ (we will use this notation 
from now on). Moreover, if we let Ut be the orthogonal projector onto T 
(according to the Hilbert-Schmidt inner product), then (although we will 
not use this fact, one can check that H^id = 0) 

(11) Vz£SO(n), ||«-id-n r (z-id)||hs <0{D(z, id) 2 ). 

This is so because if \\z — id ||h s = r <S 1, then \\z — id — h\\^ s = 0(r 2 ) for some 
h £ T, and h = h = Ht{ z — id) is the best choice of approximation one may 
make. Notice that the two equations together imply 

(12) \D(z, id) - \\U T (z - id)\U = 0(\\z - id \\l). 

We notice that these distances are all invariant under multiplication: if 
a, b, c G SO(n), 

D(ca, cb) = D(ac, be) = D(a, b) 
and similarly for hs(a, b) = \\a — 6||hs- 

4.3. The contraction coefficient. In this section we prove Lemma 1. 

Proof of Lemma 1. Consider x,y £ SO(n) and let D(x,y) = r. Our 
main task is to show that there exists a coupling (X, Y) of (K x ,K y ) with 

E[D(^y) 2 ]<(l--^y + 0(r 3 ), 

where, as in the previous section, 0(r 3 ) is some term that is uniformly 
bounded by a multiple of |r| 3 . The existence of such a coupling implies that 

W D , 2 (K X , K y ) < Ji--Ld(x, y) + 0(D(x, y) 2 ), 

which shows that K is locally ^Jl — l/^-Lipschitz for p = 2. 

Our coupling will be as follows. Suppose we set X = R(i,j,9)x with i,j, 6 
randomly picked as prescribed in the definition of the random walk. We 
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will set Y = R(i,j,9')y with the same i,j and some 6' = (6 — a)mod27r, 
where a = a(i,j,x,y) depends on i,j,x,y but not on 9. In that case 0' is 
uniform on [0, 2ir] independently of x, y, hence, (X, Y) is a valid coupling 
of (K x ,Ky). Also notice that, using the invariance of D under multiplication, 

D(X,Y) = D(R(i,j,9)x,R(i,j,9')y) 

(13) 

= D(R(i,j, 9),R(i,j, 9')yx^) = D(R(i,j, a),yx^), 

as 

R(i,j,9'YR(i,j, 9) = R(i,j,0- 9') = R(i,j, a). 

We will use (10), (11) and (12) to bound the RHS of (13): this will allow 
us to do all calculations we need in the tangent space T = T[dSO(n). First, 
however, we need an orthonormal basis for that space. For each 1 < k < £ < 
n, let ak£ G T be the linear operator that is uniquely defined by 

t = k, 

t = £, 

te{l,...,n}\{M}. 

One can check that {a,ke}i<k<e<n is indeed an orthonormal basis for T = 
T[dSO(n) with the Hilbert-Schmidt inner product. For 1 < t < n we also 
define dt £ M(n, K) as the matrix that has a 1 at the (t, t)th entry and zeroes 
elsewhere. Then (dt,d s )hs = 1 if t = s and otherwise and also (dt, aki)hs = 
for any t, k, I. With these definitions, one can write 

(14) R(i,j,a)=id + (cos a — l)di + (cos a — l)dj + v2 sin aaij. 

Now set h = Hxiyx^ — id). Since D(yx* ,id) = D(x,y) = r, ||/i||hs = t + 0(r 2 ) 
and \\yx^ — id— /i||h s = 0(r 2 ). Suppose we commit ourselves to making a 
choice of a = 0(r) (i.e., \a\ < cr for a constant c independent of r). Expand- 
ing sin and cos, we get 

\\R(h3,oi) - id-\/2aay|| hs = 0(r 2 ). 

Moreover, we also have 

D(yx*,R(i,j,a)) 

(15) 

= Wyx^ - R(i,j,a)\\ hs + 0{\\yx ] - R(i,j, a) ||£ s ) 

(16) = \\yx* - id— V2aaij + 0(\\yx^ - id— V2aaij ||^ g + r 2 ) 

(17) =||/i-v / 2aa 4i || hs + 0(r 2 ). 
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Thus, we choose a = (h, Oij)hs/\/2, which minimizes \\h — ^/2aaij\\\ ^s and 
only depends on i,j and h = Hxiyx^ — id) . Since the a^i form an orthonormal 
basis of T 3 h, we have 

h= ( h ' a ke}h s ake X! ( h > a ^)1 s = \\ h \\L = r<1 + 0(r 3 ). 

l<k<£<n l<k<i<n 

This shows that |a| = 0{r) as desired and, moreover, 

D(X,Y) 2 = D(yx\R(i,j,a)) 2 [by (13)] 
= \\h- (h, aj 3 )hs%- IlL + 0(r 3 ) 
(expand ft) = ||h||g s - (/i, + 0(r 3 ). 
If we now average over i,j,9, we obtain 

E[£(X,Y) 2 ] = ||/C * (h,a t3 )t + 0(r 3 ) 

V2/ l<i<j<n 

2 , /-V r 3^ 



(i-i)lNL + o( 



(2) 
1 

G) 



i-7^)r 2 + o(r 3 ), 



which is the desired bound. 

To finish the proof, we apply our result on local-to-global couplings, The- 
orem 3. We have shown that the Markov transition kernel P = K for Kac's 
random walk is locally C-Lipschitz for 

C= l-7m . 1<P<2. 



(2) 

The remaining assumptions of Theorem 3 are trivially verified since SO(n) 
has a bounded diameter. We conclude that 



V/i,7?GPr(50(n)), W Dj> (jiK,vK) < ^1- -^W Dj> (ji,v). D 

4.4. Mixing time upper bound. We now prove Theorem 1. 

Proof of Theorem 1. We shall apply Corollary 1 with M = SO(n), 

d = D and P = K. According to Lemma 1, we can take C = <Jl — 1/ Q) < 

(1 — k) for k = l/n 2 . 

We need an estimate for the diameter of SO(n) under D. Let a, b £ SO{n). 
Then D(a, b) = D(c, id) with c = atf G SO(n). It is well known that any such 
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c is a product of two-dimensional rotations on orthogonal subspaces; that is 
equivalent to saying that (after a change of basis of R n ) one can write 

k 

c = Y[R(2i-l,2i,6i) 

i=i 

for k = \n/2\ and — tt < 0^ < tt without loss of generality. Notice that one 
can rewrite this as [cf. (14)] 

k 

c = ^2[cos8i(d 2 i-i + d 2i ) +sin6»ja2i-i,2i]- 
i=i 

Thus, the curve 

k 

j(t) = ^2[cost9i(d 2 i-i + d 2 i) + smt9ia 2 i-i,2i\, < t < 1, 

i=l 

connects id to c in SO(n). Moreover, for all < t < 1, 

k 

1 '(t) = Y / ^[cos(t9 t + TT/2){d 2i ^ 1 + d 2l )+sm{te i + TT/2)a 2l ^ 1)2i ] 

i=l 

and one can easily see that 

k 

|| 7 '(£)||2 s = 2^ |6»i| 2 = 2/cvr 2 < vr 2 n (since k < n/2). 

i=l 

We deduce that 

Va,beSO(n), D(a, b) = D{ab\ id) < f 1 \\j'(t)\\ hs dt < tt^. 

Jo 

Thus, diani£)(50(n)) < vr-y/n and we deduce from the corollary that 



T D,2{e) < 



n 2 In 



e 



□ 



5. Mixing bounds for other random walks. In this section we briefly 
discuss the two random walks related to Kac's random walk mentioned in 
the introduction. Both proofs follow the previous one very closely and will 
be only sketched. 

5.1. Kac's walk with nonuniform angles. Recall the definitions in Sec- 
tion 4.1. In this section we let p: [0, 2ir] — > M + be a density and define a 
variant K^ p > of Kac's random walk on SO(n) as follows: 

Ki p) (S) = ± r S WdS)* (S) P (9)dO 



l<i<j<n " 



[x G SO(n),S C SO(n) measurable]. 
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j((p) corresponds to picking the rotation angle with density p. One can check 
that is a valid Markov transition kernel for any density p and that the 
original process corresponds to p = l/2ir. We will prove the following: 

Theorem 4. Suppose 

p m i n = min p{9) > 0. 

6»e[0,27r] 

Then the L 2 transportation cost mixing time of K^ p > satisfi 



TD,2{£) < 



n 2 , ( 7T\/n 



In 



e>0. 



2np n 

Proof sketch. The main step is to show that is 



/ 27T/? m j n 

y jnj contracting. 

We do this as in Lemma 1, showing that for any x, y € SO(n) with D(x, y) 
r, there exists a coupling (X,Y) of {Kx P \Ky P ^) with 

E[Z?(X,y) 2 ]<(l-^gH£)r 2 + 0(r 3 ). 

To do this, we first note that < 27rp m i n < 1 and write p as a mixture: 

p = 2irp mhl g + (1 - 27rp min )h, 
where g = l /2ir is the uniform density and 



h{9) 



-'mm 



1 - 27rp m i n 

is another density. We will set X = R(i,j,6), Y = R(i,j,6') as in the proof 
of Lemma 1, choosing 1 < i < j < n uniformly at random. The choices of 
9,9' will be made as follows: 

1. with probability 2-7r/? m i n , we pick 9 from the uniform density g and set 
9' = [9 — a) mod 2ir as in the previous proof; 

2. with probability 1 — 2irp m \ a , we pick 9 with density h and set 9' = 9. 

Using the notation and reasoning in the previous proof, we immediately see 
that in the first case D(X,Y) 2 = \\h\\^ s — (h,aij) 2 is + 0(r 3 ), whereas in the 
second case D(X,Y) 2 = r. It follows that 

E[ J D(X,y) 2 ]=27rp min j||/ l ||2 s - 7 ^ E (M^lI 

I V2/ l<i<j<n ) 

+ (l-27r /0min )r 2 + O(r 3 ) 
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5.2. A random walk on unitary matrices. In this section we consider a 
random walk on unitary matrices. To define it properly, we need a set of 
definitions analogous to that in Section 4.1, which we briefly state below. 

M(n,C) is the set of all complex n x n matrices. In the present setting 
a* is the conjugate transpose of a G M(n, C) and we can define the Hilbert- 
Schmidt inner product (and corresponding norm) via 



With this inner product, M(n, C) is isomorphic to C n with the Euclidean in- 
ner product. Call a G M(n, C) unitary if aa* = a* a = id, the identity matrix. 
The set £7(n) C M(n, C) of all n x n unitary matrices is a smooth, compact 
submanifold of M(n,C), which is also a Lie group. The metric D in this 
case is the Riemmanian metric induced on £7(n) by the Hilbert-Schmidt 
inner product on the ambient space M(n,C), which is again invariant by 
multiplication. Moreover, there exists a multiplication- invariant probability 
measure on £7(n) which we again denote by the Haar measure TC. 
Let 

e i j ■ • ■ ) e n be the canonical basis for C n . For each 1 < i < j ' < n fix a 
(linear) isometry I,y : spanjej, ef\ — > C 2 . If u G U(2), we let Uij G ?7(n) be 
the unitary operator that acts as I^ 1 o u o 1^ on spanje^, e^} and as the 

identity on spanjej, ej} 1 - (that is, Uij acts "like" u on ei,ej). Our random 
walk is defined by the kernel S given by 



where H is the Haar measure on £7(2). Thus, X =£ L x is obtained from x 
by first choosing i,j uniformly at random, then picking R G £7(2) from the 
(2 x 2) Haar measure independently from i,j and then letting Rij act over 
the two-dimensional subspace span{ej,ej}. 

Our main goal will be to prove an analogue of Theorem 1 in this setting. 

Theorem 5. Let to 20 denote the I? transportation-cost mixing time 
for (M, d) = (U(n),D) and P = L as just defined. Then 



Proof sketch. According to Corollary 1, we need two ingredients: a 
TTi/n bound on the diameter of U(n) and a "local contraction" estimate 
for (L x ,L y ) akin to Lemma 1. The diameter bound is easily obtained. Any 
u G £7(n) has orthogonal eigenvectors with eigenvalues of the form 
for 9i G [— vr,7r], 1 < i < n. For all t G [0,1], u t G £7(n) is a matrix with the 



(a,b) hs = Tx(ab*) 



a,beM(n,C). 
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same eigenbasis and eigenvalues e^~^ t6 , hence, u* G U(n) always. The curve 
i i— ► u* (t G [0, 1]) has constant speed equal to 

\ i=i 

and connects id to u; any a;,y can be connected by the curve 1 1— > (yx*) t x, 
which also has length < n^/n, hence, D(x,y) < ity/ri for all x,y G U(n), as 
desired. 

We now provide a local contraction estimate. The key realization is that 
the tangent space of U(n) at the identity is 

T = T id (U(nj) = {h G M(n, C) : = -/»*}. 

This means that if x,y G C/(n) and D(x,y) = r, 

|| yx* -id-/i||hs = 0(r 2 ) for /i = n T (yx* - id), 

being the orthogonal projector of M(n, C) onto T (as in the previous 
proof). Moreover, the estimates in Section 4.2 carry over to our current 
setting. 

Suppose x,y as above are given. We choose 1 < i < j < n uniformly at 
random, R G U(2) from the Haar measure and will set R' = Rv for some 
v = v(i,j,x,y) in U(2) to be chosen, so that R' is also Haar distributed on 
U(2), independently of i,j,x,y. This implies that 

(X,Y) = (R ij x,R' ij y) 

is a valid coupling of (L x ,L y ). Moreover, 

D(X,Y) = D( Vlj ,yx*). 

We will now define an orthonormal basis for M(n,C). For k,£ G {1, . . . , n}, 
let Uk^i be the unique linear operator that maps e k to and et to for 
all t^k. The matrices {uk—>e}i<k£<n form a orthogonal basis of M(n, C). 
Since h* = —h, one can check that 

n 

h = ^\f^lh(k,k)u k ^ k + (h(kj)u k _ e - h{k,£)u e -,k), 

k=l l<k<e<n 

with h{k,k) G M and h(k,£) G C. By orthogonality, we have 

n 

lHlL=£fc(M) 2 +2 E IMM)| 2 - 

k=l l<k<i<n 

We will make a choice of i> such that 

Vij = I^ 1 o v o = e hi * with 

= {\f^l{h(i,i) + h(j,j)) +h(i,j)ui-*j - h(i,j)uj^i). 
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Indeed, since h*j = —hij, Vij E U{n). Moreover, since e h ^e t = e t for t / i,j, 

this e hi i acts nontrivially only on span{ej,ej} and one can easily see that 
this implies the existence of the desired v. Finally, this v only depends on 
i,j and x, y [through h = Tlxiyx* — id)], therefore, it is a valid choice for the 
coupling construction of R and R' = Rv. 

One can check that \\vij — id ||h s = 0(r), that \\v — id— /iy||hs = 0(r 2 ) and, 
therefore, 

D(X,Y) 2 = D(vij,yx*) 2 

= \\v ij -yx*\\l s + 0(r 3 ) 
= \K-h\\l + 0(r 3 ) 
(expand - h) = \\h\\l s - h(i,i) 2 - h(j,j) 2 -2h(i,j) 2 . 
Averaging over the choices of u, i and j, we get 

2 n 1 

(18) E[D(X,Y) 2 ) = \\h\\l--J2Ki,i) 2 --^ E 2/ l (,, J ) 2 + 0(r 3 ) 

i=l \2) l<i<j<n 

(19) <(i__Ly + 0(r 3). 
This implies that the chain L is 

^1 — -py-locally contracting, 

which implies the desired result via Theorem 3. □ 

6. Lower bounds for mixing times. In this section we prove a general 
mixing time lower bound for random walks induced by group actions. Again, 
let (M, d) be a metric space. 

Assumption 1. M is compact (hence Polish). There exists a group 
G acting isometrically on M on the left. That means that there exists a 
mapping taking (g, x) £ G x M to gx 6 M such that for all g,h€G, g(hx) = 
(gh)x and for all g £ G, x,y £ M, d(gx,gy) = d(x,y). We also assume that 
there is a metric d on G such that (G, d) is compact and 

Vg, h G G, d(g, h) > sup d(gx, hx). 

Finally, a Markov transition kernel P on M is defined via a probability 
measure a on G as follows: 

VxGM, P x = / 5 hx da(h). 
Jg 

That is, to sample X ==c JF^b, one samples /i =£ a and sets A = foe. 
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One can check P is indeed a Markov transition kernel; indeed, this fol- 
lows from the fact that x i— ► P x is 1-Lipschitz as a map from (M, d) to 
(Pr(M),Wd,i). 

It is well known that compactness of (M, d) and (G, d) imply the following 
(we will use ~ to denote all quantities related to the metric d): 

• For all r > and H C G, H can be covered by finitely many open balls 
of radius r in G; the minimal number of balls in such a covering is called 
the r-covering number of H and denoted by Ch{t). 

• For all r > 0, there exists a number Nm{t), called the r -packing number 
of M, which is the largest cardinality of a subset S C M with d(s, s') > r 
for all distinct s,s' £ S (we call such an S maximally r-sparse). 

We can now state our general lower bound result. 

Theorem 6. Under Assumption 1, suppose that there exists a measure 
fj,* £ Pr(M) and numbers r G N, e > and p>l such that 

VxEM, W dtP (PZ,fi*)<e. 

Then 

]nN M (8e) - In 2 
T ~ In C H (e/T) ' 

where H is the support of a. 

To understand Theorem 6, it is a good idea to consider the special case 
M = G is a finite-dimensional Lie group (acting on itself by left-multiplication), 
/x* is a Haar measure on G, P l x — ► for all a; G G as t — > +oo and r = Td !P (e) 
is the e-mixing time. Since G is a Lie group, thus a smooth manifold that 
is locally Euclidean, one would expect that 

lnA^c(r) ~ (dimension of G)ln(l/r), < r Cl. 

Similarly, if H has a dimension (in some loosely defined sense), we expect 
that 

In Cu(r) ~ (dimension of H) ln(l/r), < r <S 1. 
Thus, for small enough e, one would have 

(dimension of G) 

T d, P { £ ) ^ Ta- : TU\> 

(dimension of ti) 

at least up to constant factors. The upshot is that a "small" (low-dimensional) 
set of generators H cannot generate a "large" (high-dimensional) group G 
in time less than the ratio of the dimensions. 
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Of course, the reasoning we just presented is not a rigorous proof. In the 
particular case of Kac's walk, we will need to have bounds on Ch{e) and 
Nq(e) that work for a fixed e, not for e — > 0. 

Let us now prove the general theorem (the bound for Kac's walk is proven 
subsequently) . 

Proof of Theorem 6. Let c = Ch(s/t). By assumption, H can be 
covered by c open balls of radius e/r according to d, which we represent 
with B: 

H C B(h u e/r) U • • • U B{h c , e/r). 

Define the sets Si = B(h u e/r) n H and Si = B^.e/t) DH\ (J}"* B(hj, 
e/r). These sets form a partition of H, hence, the following sum defines a 
probability measure supported on {hi,..., h c }: 

i=l 

In fact, /? is the image of a under the map \£ that maps the elements of Si to 
hi, for each i E {1, . . . , c}. This map satisfies d(^(h), h) < e/r because Si C 
B(hi,e/r) by construction. One may check that this implies Wg (ct,f3) < 
e/r. 

Let Q be the Markov transition kernel corresponding to (3 in the same 
way that P corresponds to a; that is, 

VxEM, Q x = [ S hx dp(h)=J2a(Si)S hiX . 

J{hi,...,h c } i=1 

For any x E M, if the random pair (A, B) is a coupling of (a, j3) with E[d(A, 
B)p]Vp < e / Tj (Ac,Px) is a coupling of (P X ,Q X ) with 

^, P (P*, Qx) < miMBxY] 1 ^ < E[d(A, BYflP < e/r. 

Hence, 

VxEM, W diP (P :c ,Qx)<e/T. 
A simple calculation implies 

VxEM, W d , p (Ql,P^)<8. 
For any x E M, the definition of r implies Wd :P {n*,P x ) < e, so that 

W^CQ^,) < W diP (P;,/x») + W dlP (i£,Q£) < 2e. 
Thus, the p-optimal coupling (X X ,Y) of (Q x ,n*) achieves 

E[d(X x ,yf] 1 /P<2e. 
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Now let S C M be a maximal 8e-sparse subset of M. Notice that the cardi- 
nality of S is t = Nm{8e), by definition of the latter quantity. One may define 
random variables {X x } x ^s, Y on the same probability space such that, for 
each x G S, (X X ,Y) is a coupling of (Q x ,n*) achieving the above bound (this 
follows, e.g., from the Gluing lemma in the Introduction of Villani's book 
[15]). Hence, if 

Ix = X{d(X x ,Y)>4e} {x^S), 

Markov's inequality implies that 

P(4 = 1) < 1/2 

and 



E 



.x£S 



<2- 



It follows that there exists a realization of {X x } x ^g,Y such that d(X x ,Y) < 
Ae for all x in a subset S" C S of cardinality >£/2. 

We yjx such a realization. For each x £ S, the support of the measure Q x 
is contained in the finite set {h±x, . . . , h c x}. A simple inductive argument 
shows that X x = v x x for some 

v x eV T = {h h h i2 ■■■h iT :ii,i 2 ,. ■ . , V G {1,2,. . . ,c}}. 

Now notice that, on the one hand, for all x, x' £ S', 

d(X x ,X x ,) < d(X x ,Y) + d(X x/ ,Y) <4e + 4e = 8e. 

On the other hand, for distinct x,x' £ S, if v x = v x >, then d(X x ,X x i) = 
d(v x x,v x x') = d(x,x') > 8e since S is 8e-sparse and d is invariant by left 
multiplication. We deduce that 

\/x,x'€:S', x^x =>■ V X ^V X '. 

This implies that 

1/2 < cardinality of S' < cardinality of V T 

and the latter quantity is clearly upper bounded by c T . We deduce that 

ln£-ln2 

£2<c T t> . 

mc 

The proof is finished once we recall that t = Nm(8e) and c = Ch{s/t). □ 
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6.1. The lower bound for Kac's random walk (Theorem 2). We now show 
that Theorem 2 follows from the general lower bound. 

Proof of Theorem 2. We will freely use the notation introduced in 
Section 4. In particular, we take M = G = SO(n), d = hs, P = K, /i* = 7i, 
T = T d , p (e) and 

(20) H= (J {R(i,j,0):6€[0M} 

l<i<j<n 



and 



I--Z1T 

\9J 1 ■'O 



l<i<j'<n ' 

Notice that d is right-invariant and that we may take d = d in this case. 

We now upper bound C#(r). Equation (20) shows that -ff is the union of 
(2) sets. Each of those is an isometric image of the unit circle in the intrinsic 
metric D of SO(n). Since hs is dominated by D, we have 

(21) V0<r<2vr, C H (r) < 2vr Q r" 1 < vrnV" 1 . 

We must also lower bound Nm{t)- A maximal r-packing S C SO{n) has 
to satisfy min se ,g d(x, s) < r for all x £ SO(n) (an x violating the bound 
could be added to S, which violates maximality). This implies that 

SO(n) = (J B(s,r) ^ H(B(s,r)) > 1 

(22) j 

[Implication (Inv) uses the invariance of which implies that all balls of 
radius r have the same measure.] 
We now make the following claim: 

Claim 1 . There exist constants <j>, if) > such that, for all n > 10 and 
0<r < 

/„<t> r \ ipn 2 

W(B(id,r))<(-= 



The restrictions on n, r are by no means sharp, but they give us some 
room to spare in what follows. 
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Before proving the claim, we show how the theorem follows from it. Given 
> 10, assume r = 7d jP (e) < n 3 /ir. We see that, for < e < 1, 

T\nC H {e/T) < (lnr + 21nn + ln7r + ln(l/e))r 

< (5 Inn + ln(l/e))r (use r < n 3 /7r); 
\nN M (8e) > -lnH(id,8e) [via (21)] 



>Vn 2 (^^+ln(l/8e)-^. 



This implies that, for e = e ^/8 < 1, one can use the bound in Theorem 6 
to see that 

T > —r- 1 ; > Ctl 2 

5 In n + (p + In 8 

for some c > not depending on n. Of course, if r > n 3 /ir, r > an 2 for a 
(possibly smaller) c > 0, so the inequality presented above actually implies 
the theorem for n > 10. Since there is only a finite set of remaining values 
of n, one may finish the proof by picking a smaller c, if necessary. 

It remains to prove the claim. We will do so via probabilistic reasoning, 
using some rough upper estimates and known results for spheres in an ar- 
bitrary dimension. As a preliminary, consider x G SO(n) and let %i € M. n 
denote its ith column. One has 



\x - idH^ = ^2\ Xi - ei 



-E'~ 

The columns of x are orthonormal, hence, \xi\ = leJ = 1 and 



lk-id||hs = 2 Z!( 1 ~ x i- e *)- 

i=l 

Hence, 

I " r 2 

\\x-id\\ hs <r - V(l - Xi.ei) < — . 

n r~i In 

i=i 

One can now use Markov's inequality to deduce that 

\\x — id||hs < f => 3/ C {1, . . . , n} with |/| = \n/2] such that 

(23) 

Vi £ /, Xi.ei > 1 

n 

Thus, if X =£ 7i is a random variable, defined on some probability space 
(0, J 7 , P) and with values in SO(n), 

W(J3(id,r)) = P(pT-id|| <r) 



r 2 
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n 



< ]T P(Vt€J,X i .e i >l-- 

ZC{l,...,»}:|J|=[n/21 



(24) (£(Xi-.iei) i s the same for all I) 



.Vl<t<rn/2l,X i .e<>l- — 
I \ I \ n 

( r 2 

<2T VKK \n/2],Xi.ei>l 

\ n 

Now consider the orthogonal projection maps: 

i-l 

n i = n i (x)-. zeM n ^ z -Y / (^k,z)x k , 

k=l 

with 111 is the identity operator. Clearly, < LT^e, < 1 for all 1 < i < |~n/2] 
with probability 1. X, belongs to the range of IT, a self-adjoint operator. 
Hence, outside of a null set, 

X t .ei = UiXi.ei = Xi.Ikei > 1 - — X;.-J^r > 1 - — . 

This implies the bound 



Vl<i<\n/2],Xi.ei>l-- <P f| 



77 



/fn/21 



-'z 

i=l 



fwith^ = (x i . T ^L>l-- 
V I |IIiei| ra 

Let J-q = {0, f2} be the trivial u-field on Q and, for 1 < j ' < n, Tj be the a- 
field generated by X%, . . . , Xj. These a-algebras form an increasing sequence. 
We omit the proof of the following three facts, valid for each 1 < i < [n/2] : 

1. Ei is .Fj-measurable; 

2. IIjej/|IIjej| ^"j-i-measurable; 

3. conditioned on J-i—i, X{ is uniform on the (n — z)-dimensional unit sphere 
of the subspace of M. n corresponding to the range of IT, and IL;ej/|IL;ej| 
is a point on that same sphere. 

Let V n -i be the (normalized) uniform measure on S n ~ l . The above consid- 
erations, together with the rotational invariance of V n -i, imply that 

VI < i < \n/2\ , F(Ei | Fi-i) = V n _i(CVi(l - r 2 /n)) a.s., 

where, for a given reR, C n _j(r) is the spherical cap 

Cn-iOO = v.ei>T}. 
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A simple inductive argument with conditional expectations then shows that 

An/2] \ 

Pf fl Eij =nm[n/2]\F[n/2]-l)X n ^-l E ] 

= V„_ r „ /2l (C n _ rn/2l (l-r 2 /n))P f) Ei) 



i=l 



(...) 



\n/2] 

= Vn-iiCn-iil-^/n)). 
i=l 

We now apply known bounds on the volume of spherical caps [2], Lemma 
2.1: 

VmeN\ {0, 1}, Vr £ [2/y/m, 1], 

(25) 

fl _ T 2\{m-l)/2 d_ 2\{m-l)/2 

<V m (C m (r))< { 



We need the upper bound for n — \n/2\ < m < n — 1 and 

~2 



r = 1 , which £ 

n 



,1 



for n> 10, r < Vra/10. 
Moreover, we know that in this case 2r 2 > 2 — 4r 2 /n > 1, so 



n *) i n v»- i( c„-, ( i - ,>» < n - * (y 

some constants fa 
(24), we deduce that 

n(B(id,r))<2 n F(f] eA<^—) , 

with </> > another constant. The claim and the theorem are finally proven. 
□ 



\ i=l / i=l i=l 

for some constants fa,ip > not depending on n > 10 or r < -y/n/10. Using 



7. Final remarks. 

• The most obvious problem left open in the present paper is a sharp char- 
acterization of the mixing time of Kac's walk. We conjecture that our 
upper bound is tight for all e £ (0,£o); that is, that there exist constants 
c, £o > such that, for all n > 3 and e £ (0, eo), 

7"hs,i(e) > cn 2 In 
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Notice that the restriction to n > 3 is necessary, as for n = 2 the walk 
mixes perfectly in one single step. 

The quantity n 2 lnn in the conjectured lower bound immediately sug- 
gests a "coupon-collector phenomenon." For instance, one is tempted to 
guess that the walk cannot mix before 2-dimensional rotations have been 
applied to all possible pairs ej, ej of canonical basis vectors. The difficulty 
with this idea is that two rows of X(t) may "interact" without ever being 
changed in the same step of the walk. 

• The simple lower bound method in Section 6 cannot go farther than Q(n 2 ), 
even if e — ► with n. It would be interesting to derive better lower bounds 
at this level of generality. 

• Going back to the application of Ailon and Chazelle [1], 0(n 2 lnn) mix- 
ing is still too large for n big, which is precisely when dimensionality 
reduction is the most useful. However, that application only requires that 
certain projections behave as they should, which is a less stringent re- 
quirement than approximating the Haar measure. It is thus natural to 
ask whether better bounds might be available for that specific applica- 
tion. More precisely, let Yk(t) = HkX(ty, where X(t) is a realization of 
Kac's walk and is the projection onto the first k canonical basis vec- 
tors. Clearly, {Yk(t)}fJ^ corresponds to a Markov chain on the Stiefel 
manifold: 

V k (R n ) = {(«!,. ..,«*)€ (R n ) k ■ VI <i,j< k, Vi. Vj = <%}. 

One can adapt the proof of Theorem 2 to show that this walk cannot mix 
in less than £l{nk) time. 

We conjecture that Yfc(i) mixes in Q(nklnn) steps. Recall that for 
dimension reduction we need k = OQn\S\). Our conjecture would imply 
great time savings for n^> In \S\. 

• Theorem 3 on local-to-global coupling can be used to reprove some known 
results. Consider, for instance, a Riemannian manifold M with dimension 
n, distance d and Ricci curvature lower bounded by K £ R. Let P = P^ 
correspond to the ball walk on M where a step from x consists of choosing 
X uniformly from the ball B(x,e). Using a simple, "strictly local" variant 
of [17], Lemma 2, and our Theorem 3, one can very easily show that 
H i ^ fj,p( £ > is (1 — Ke 2 /2(n + 2) + o(e 2 ))-Lipschitz (thus contracting when 
K > and e is small enough). By "strictly local," we mean that we do 
not need to have control i(P x , Py) uniformly over all pairs of nearby 
points in the manifold: we just need that for each fixed x G M, as y — > x, 

W d ^\P^)<{i + o{l))d{x,y) 

for the appropriate £ > 0. 

We expect that checking the local Lipschitz condition in other applica- 
tions will oftentimes be much simpler than proving a global contraction 
estimate. 
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